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ABSTRACT 

Data labeling is a necessary but often slow process that 
impedes the development of interactive systems for mod¬ 
ern data analysis. Despite rising demand for manual data 
labeling, there is a surprising lack of work addressing its 
high and unpredictable latency. In this paper, we introduce 
CLAMShell, a system that speeds up crowds in order to 
achieve consistently low-latency data labeling. We offer a 
taxonomy of the sources of labeling latency and study sev¬ 
eral large crowd-sourced labeling deployments to understand 
their empirical latency profiles. Driven by these insights, we 
comprehensively tackle each source of latency, both by de¬ 
veloping novel techniques such as straggler mitigation and 
pool maintenance and by optimizing existing methods such 
as crowd retainer pools and active learning. We evaluate 
CLAMShell in simulation and on live workers on Amazon’s 
Mechanical Turk, demonstrating that our techniques can 
provide an order of magnitude speedup and variance reduc¬ 
tion over existing crowdsourced labeling strategies. 

1. INTRODUCTION 

Modern data analysis is fundamentally centered around the 
human analyst and her ability to rapidly iterate between hy¬ 
potheses and evidence. Towards this goal numerous projects 
have optimized individual data analysis components (e.g., 
data ingest [1, 42], data analytics [14, 38, 47, 56, 55], vi¬ 
sualization [53, 52, 34, 26] and predictive models [16, 39, 
25, 18]) as well as multi-stage workflows [20, 29] to reduce 
end-to-end latency of data analysis. 

Unfortunately, these advances continue to be hindered by 
the need for synchronous human effort, often in the form 
of manual labeling. For example, human workers are fre¬ 
quently tasked to label training data (e.g., sentiment analy¬ 
sis, user preferences) for machine learning models. Similarly, 
many data cleaning systems [17, 51, 48, 27] rely on crowd 
workers to provide labels for entity resolution, value imputa¬ 
tion, and other error mitigation algorithms. In fact, a recent 
survey of software companies [36] found that these compa¬ 
nies use crowd workers to complete hundreds of thousands 


of data cleaning tasks per day. Such heavy reliance on man¬ 
ually generated data inevitably limits the speed of analysis 
pipelines by the latency of their crowdsourcing steps. 

All crowd-based data labeling systems seek to reduce cost 
and speed while maximizing quality. However, most research 
has focused only on the trade-off between quality and cost, 
with work on crowdsourcing routinely reporting task laten¬ 
cies on the order of minutes to hours to complete an average 
task [6, 30, 15]—clearly unacceptable for user-facing data 
systems. 

In this paper, we explicitly tackle the trade-off between cost 
and latency for crowd-sourced labeling tasks. Though there 
are a few existing works that explicitly aim at tackling la¬ 
tency, they are either tailored to specific tasks [35, 37, 49], 
targeted towards a single source of latency such as recruit¬ 
ment time [5, 8], or focused on machine learning techniques 
(e.g., active learning) that ignore the practicalities of live 
crowdsourcing and may be counterproductive in terms of 
wall clock latency [40]. 

In addition, predictability of overall task latency is an im¬ 
portant consideration that has not been carefully studied. 
Depending on the numerous external factors, the quantity, 
quality, and speed of available workers on crowd platforms 
such as Amazon’s Mechanical Turk (MTurk) [41] can fluc¬ 
tuate wildly [23, 22] and result in individual task latencies 
from seconds to even days. We argue that in order to be use¬ 
ful for user-facing applications, the variance of task latency 
must be within single-digit seconds before it can be em¬ 
bedded in interactive user-facing applications such as Data 
Wrangler [27]. 

In this paper, we introduce CLAMShell, a system that speeds 
up crowds in order to achieve consistent, low-latency data 
labeling. Rather than focus on a single algorithm or step 
in the data labeling lifecycle, our goal is to develop a col¬ 
lection of pragmatic techniques to clamp down on latency 
and variance during all stages of labelling. To this end, we 
first perform an empirical study of the dominant sources of 
latency—per-task latency, batch-wise latency, and end-to- 
end overall latency. We then systematically address each 
major source through three novel techniques: Straggler mit¬ 
igation uses redundant labelers to mitigate ‘straggler tasks’ 
at the end of batches, decreasing the variance of batch label¬ 
ing time from minutes to fractions of seconds. Pool mainte¬ 
nance uses threshold-based eviction techniques to maintain 
a pool of fast, high-quality workers and decrease the average 
time to label each task. Hybrid learning combines active and 
passive learning to exploit crowd pool parallelism when there 
are more workers available than the active learning batch 



size, and dynamically favors passive learning on datasets 
where active learning performs poorly. Our evaluation of 
CLAMShell, a system that implements these techniques on 
live workers, demonstrates up to 8 x speedups in label ac¬ 
quisition time and over 2 orders of magnitude reduction in 
variance compared to typical non-optimized deployments. A 
key benefit of our work is that all of these optimizations are 
compatible with standard quality control algorithms such 
as redundancy-based voting schemes and worker quality es¬ 
timation algorithms. 

2. STUDYING CROWD LATENCY 

In this section, we categorize the primary sources of crowd- 
sourced microtask latency, describe existing work that ad¬ 
dresses crowdsourcing latency, and outline our approach to¬ 
wards a comprehensive solution. We include a study of one 
crowd-labeling MTurk deployment that ran ~ 60,000 tasks 
to label medical publication abstracts. A full analysis of this 
and three other microtask deployments can be found in our 
technical report. 

2.1 Sources of Latency 

A multitude of factors can increase latency, from algorithm 
choice to worker and environmental factors. We find that 
categorizing the factors based on the granularity of work 
provides a clear decoupling of algorithmic contributions from 
systems concerns. Specifically, latency might arise from the 
speed of a single task, a fixed batch of tasks, or the full run 
of multiple batches (of possibly varying sizes). 

Per-Task Latency We can view the latency of a single task 
as a linear sequence of three phases: 

1. Recruitment: Workers do not immediately begin work¬ 
ing on newly submitted tasks, and recruitment latency 
consists of the time until an interested crowd worker 
accepts a newly posted task. In the medical deploy¬ 
ment, the min, median and standard deviation statis¬ 
tics were 5, 36, and 9 minutes, respectively. 

2. Qualification and Training: Once workers accept a 
task for the first time, they are often presented with 
tutorials or qualification tasks before they are permit¬ 
ted to perform actual work. 

3. Work: The amount of time a worker spends to com¬ 
plete a task can vary depending on the worker com¬ 
petency, the time of day, fatigue, and numerous other 
factors [32, 23]. Note that a single task may produce 
multiple labels if records are grouped into tasks (a com¬ 
mon practice). 

Per-Batch Latency We define the batch latency as the 
time for all tasks in a fixed-sized set to fully complete when 
sent to a crowd, which is dependent on the latency distri¬ 
bution of all available workers in addition to each worker’s 
individual variations. 

For example, in the medical deployment, the median and 
standard deviation to complete a given HIT were 4 and 
2 minutes, respectively, while the 90 th percentiles are up¬ 
wards of 1.1 and 3 hours , respectively. Although each HIT 
produces multiple labels, this extreme long-tail distribution 
is common-place on microtask platforms like MTurk, and 
driven by three sources: 

1. Stragglers: The batch must block until the slowest 
task is completed - up to 3 OOM slower than the me¬ 
dian. 


2. Mean Pool Latency (MPL): The expected latency de¬ 
pends on the MPL, which varies from 1 to 2 minutes. 

3. Pool and Worker Variance: The long-tail ultimately 
results in high variance within and between batches. 
The most and least consistent workers had standard 
deviations of 4 minutes and 2.7 hours, respectively. 

These sources contribute to task response times that are, in 
practice, slow and extremely variable. 

Full-Run Latency Rather than require crowd workers to 
label terabytes of data, machine learning is often used to 
infer labels once enough records have been labeled to train 
a high-quality model. Active learning can reduce the size 
of this training set, however training the model requires ac¬ 
quiring small batches of labels in a blocking fashion. This 
induces four latency sources: 

1. Decision Latency: The time to pick the next batch of 
tasks (e.g., uncertainty sampling for active learning) 

2. Task Count: The number of labeling tasks, which ma¬ 
chine learning approaches seek to reduce. 

3. Batch Size: The batch size affects both active learning 
convergence as well as the amount of parallelism within 
a batch. 

4. Pool Size: The number of workers completing tasks 
controls the maximum parallelism possible, however is 
often dictated by operational constraints. 

Active learning can drastically reduce the task count, but 
incurs increased decision latency and requires limited batch 
sizes to be effective. In contrast, passive learning can lever¬ 
age the parallelism of all available workers, but might require 
many more tasks to train a model of equivalent accuracy. 
The choice ultimately depends on the labeling task, as we 
show empirically in Section 6.5. 

2.2 Tackling Latency 

Task Latency Batch Latency Full-Run Latency 

Recruitment* Stragglers Decision Time 

Qual & Training Mean pool latency Task Count* 

Work* Pool variance Batch Size 

Pool Size 

Table 1: Classification of sources of latency in data labeling. 

Table 1 summarizes the sources of latency described in the 
previous section, and notes (*) sources that have been ad¬ 
dressed in the literature. From the table, it is clear that 
there is ample opportunity to improve the state of crowd- 
sourced latency. 

Existing Literature The primary work adresses recruit¬ 
ment time, a dominant source of task latency. Bigham et 
al. [8] frequently repost tasks (among other techniques) to 
improve the chances of workers accepting their tasks. How¬ 
ever, if widely adopted, such techniques would likely exac¬ 
erbate recruitment time. Bernstein et al. [5, 7] proposed the 
retainer model , which pre-recruits a pool of crowd workers 
(a retainer pool) and pays them to stay and be ready to 
accept tasks. In settings where tasks are streaming or come 
in batches, this model can effectively eliminate recruitment 
time at a small cost. In our work, we build on top of the 
retainer model. 

Work time has been reduced by re-designing task inter¬ 
faces [35]. For example, Marcus et al. [37] study join inter¬ 
faces for images, and design interface batching techniques 
that let workers complete up to 9 pair-wise comparisons 
in the same time as a single pair-wise comparison task. 



However, these approaches are task specific, so CLAMShell 
views them as complementary to its general task optimiza¬ 
tion framework and does not explicitly address them. 

Finally, algorithmic analysis and machine learning have been 
used to reduce task count. The former focus on efficient 
algorithms for specific operations (e.g., entity resolution [50], 
counting [35], or information retrieval [12, 43]). These focus 
on full-run latency, and could leverage CLAMShell’s per- 
task, per-batch, and machine learning techniques. 

The latter trains models using data from completed tasks 
until the prediction quality exceeds a user-defined threshold, 
and then is used to predict the remaining responses. In this 
setting, active learning [11] is a commonly used method [40, 
17]. Given unlabeled data, active learning iteratively uses a 
point selection algorithm to pick a small set of informative 
points to acquire labels for, and incorporates the new labels 
into its model. The algorithm continues until the model ac¬ 
curacy (e.g., cross-validation) converges. Active learning is 
indispensible when there are more items than can be practi¬ 
cally labeled, and can be used in conjunction with algorith¬ 
mic approaches that rely on the labels [12, 50]. 

Despite reducing the task count, active learning may counter¬ 
intuitively increase the overall latency by constraining the 
parallelism due to its batch size limitations. Its convergence 
properties have only been proved when the batch size is 1. 
and larger batch sizes (e.g., 10) have only been tested empir¬ 
ically. When the number of workers significantly exceeds the 
batch size, active learning can be much slower than labeling 
as many random tasks in parallel as possible and using a 
passive learner. 

Towards a Comprehensive Solution The core problem 
is a trade-off between cost and latency: 

Problem 1 (The Crowd Labeling Problem). A user 

wants to label N items using a pool of p workers at an 
accuracy level of a (e.g., a% of all items are labeled 
correctly). Minimize the metric where l is the 

latency to label the items, c is the total used cost, and (3 is 
a user-specified parameter expressing a preference for speed 
versus cost. 

To this end, we systematically tackle the primary sources of 
latency (Table 1) in a general purpose labeling system: 

1. Task Latency: CLAMShell addresses task latency by 
adopting retainer pools to reduce recruitment costs. 
CLAMShell automatically maintains the pool size at 
p as workers abandon the pool, and provides guidance 
about how the cost and latency will be affected by 
changing p. In addition, CLAMShell trains and verifies 
worker qualifications as part of recruitment, ensuring 
that every worker in the pool is immediately available 
to provide useful work when new tasks arrive. 

2. Batch Latency: Straggler mitigation uses worker re¬ 
dundancy on slow tasks to compensate for long-tail 
latencies. Pool maintenance selectively replaces pool 
workers to progressively shift and tighten the latency 
distribution towards faster responses. Together, they 
eliminate straggler effects, reduce mean pool latencies 
over time, and significantly reduce batch variance. 

3. Full-Run Latency: CLAMShell uses a hybrid strategy 
that allocates subsets of the worker pool to active and 
passive learning. In addition, CLAMShell pipelines the 
expensive model retraining and uncertainty sampling 
steps with crowd labeling to eliminate decision latency 
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Figure 1: CLAMShell architecture diagram. 


at the cost of slightly stale model results. 

Note that CLAMShell does not explicitly address work time , 
nor pool size: work time is often specific to the task inter¬ 
face, which we view as an orthogonal interface optimization 
problem, and pool size is a parameter to The Crowd La¬ 
beling Problem and typically set by operational constraints. 
Instead, the following text focuses on the other sources of 
latency listed in Table 1. 

3. THE CLAMSHELL SYSTEM 

In this section, we present an overview of CLAMShell, a 
system for fast label acquisition. 

The CLAMShell architecture is illustrated in Figure 1. The 
user submits a set or stream of labeling tasks to the Batcher 
and uses the Task selector (Section 5.1) to pick B incom¬ 
plete tasks to process in the current iteration. The tasks are 
selected via uncertainty sampling using the most recently 
trained model to pick tasks that benefit active learning, and 
random sampling to pick tasks for passive learning. The 
resulting batch is sent to LifeGuard , which schedules tasks 
within the batch to be sent to the Crowd Platform. This 
level of indirection is necessary when the batch size exceeds 
the size of the retainer pool, and so the Mitigator can control 
redundancy when there are slow tasks. 

The Crowd Platform holds a set of slots (Si ... S 4 ) in the 
current retainer pool. Each slot corresponds to a persistent 
retainer task that a crowd worker has accepted, and may be 
empty (e.g., S4) or contain a task (e.g., To). The Scheduler 
immediately sends new tasks to available slots (e.g., S3). 
If all tasks have been sent, then the Mitigator sends dupli¬ 
cate (mitigation) tasks for slow, incomplete tasks (e.g., Si). 
If a slot is consistently performing slowly, the Maintainer 
may recruit and train a worker for a replacement slot in the 
background, and evict the slot (X in S4) when the new one 
is available. 

Completed labels are sent directly to the Batcher , which re¬ 
trains the machine learning model. The Task Selector uses 
different sampling algorithms such as uniform sampling, ac¬ 
tive learning-based uncertainty sampling, or our hybrid sam¬ 
pler, to pick the next batch of tasks. During the entire pro¬ 
cess, the user receives the completed labels, and is able to 































query the currently trained model for new predictions. Next, 
we show an example to describe the use of CLAMShell in 
practice. 

Example 1. Imagine a news outlet is covering a live po¬ 
litical debate, and wants to monitor and visualize the pub¬ 
lic’s reaction to candidates’ comments on hot-button issues 
by analyzing the sentiment of related tweets. Because au¬ 
tomated sentiment analysis techniques on tweets are often 
inadequate [2], the company asks a crowd to label tweets as 
“positive”, “negative”, or “neutral”. If the system suffered 
from high crowd latency, the sentiment visualization would 
be unable to keep up with the changes in public opinion as 
the debate proceeded, rendering the tool unhelpful. 
CLAMShell can be used to address this issue with both per- 
batch and full-run optimizations. The per-batch optimiza¬ 
tions, including straggler mitigation and pool maintenance 
techniques, are designed to reduce the time that is required to 
label a batch of tasks using crowds, e.g., asking for crowd la¬ 
bels for a batch of ten tweets. Once the company has enough 
labeled data, they hope to switch to an automated process in 
the long term. The full-run optimizations, including hybrid 
learning, are designed to reduce the number of iterations that 
a learning model needs to converge, i.e., the total number of 
batches that we need to ask crowds to label. 

4 . PER-BATCH LATENCY OPTIMIZATION 

Per-batch optimizations aim to reduce the latency for a sin¬ 
gle batch of labeling tasks — the Batcher sends a batch of 
tasks to a pool of workers, and waits for the batch of work 
to complete. In this model, the dominant costs are due to 
the variability of worker latencies within the pool, as well 
as the variability within the tasks that a single worker per¬ 
forms. 



Figure 2: Distribution of worker latencies. 

For example, Figure 2 depicts per-worker means and stan¬ 
dard deviations of latency from the medical deployment as 
CDFs. We can see that average worker speeds are spread 
out from tens of seconds to hours. In addition, even workers 
who are very fast on average (~ 1 minute) can take as long 
as an hour or more to complete some tasks. This variation 
is bad for per-batch latency because the batch must block 
until all of its tasks are complete. 

So in order to reduce per-batch latency, a system must re¬ 
duce both the mean of the latency distribution (workers who 
are slow on average) and its variance (workers who are in¬ 
consistent). This section describes the mechanisms for each 
of these approaches respectively, along with mathematical 
models and simulation results. Section 6 evaluates these 
strategies on a live deployment on MTurk. 

Throughout the following sections, we reference experiments 
run in simulation. The simulator setup is described in Sec¬ 
tion 6.1, but due to space constraints detailed analysis of 
the results is omitted and can be found in our technical re¬ 
port. 


4.1 Straggler Mitigation: Reducing Variance 

In cluster computing frameworks such as Hadoop [21] or 
Spark [55] where the presence of straggler tasks in a stage 
(e.g., reduce stage of MapReduce [13]) can delay downstream 
computation, replicating the slow tasks [4, 13, 3, 54] via 
speculative execution or task cloning [4] is an effective counter¬ 
measure. 

We take a similar replication-based approach to human strag¬ 
glers in our crowd pool. We call a worker active if she is cur¬ 
rently working on a task, and available otherwise. Similarly, 
a task is either active , complete , or unassigned. By default, 
CLAMShell routes only unassigned tasks to available work¬ 
ers until all tasks are complete. Once all tasks are active or 
complete, available workers must wait until the next batch to 
receive a task. With straggler mitigation, in contrast, such 
workers are immediately assigned active tasks, creating du¬ 
plicate assignments of those tasks. CLAMShell returns the 
first completed assignment of a task to the user and imme¬ 
diately reassigns all other workers still working on that task 
to a new unassigned or active task (though it pays them for 
their partial work on the old task regardless). The effect 
of straggler mitigation is that when an inconsistent worker 
takes a long time to complete a task, the system hides that 
latency by sending the task to other, faster workers. As a 
result, the fastest workers complete the majority of the tasks 
and earn money commensurate with their speed. For exam¬ 
ple, in the medical deployment, the fastest worker (fi = 28.5 
seconds) could complete, on average, 8x as many tasks as 
the median worker (fi — 4 minutes). 

Simulation. A natural question arises when performing 
straggler mitigation: which task should be assigned to an 
available worker? We ran simulation experiments testing 
several straggler routing algorithms, including routing to the 
longest-running active task, to a random task, to the task 
with fewest active workers, or to the task known by an oracle 
to complete the slowest. 

To our surprise, the selection algorithm didn’t affect end- 
to-end latency, and random performed as fast as the oracle 
solution because the fast workers complete tasks so quickly 
that they complete almost all of the tasks in the batch any¬ 
ways. 

A second question is: at what batch sizes is straggler miti¬ 
gation effective? We study this in simulation by varying the 
pool size to batch size ratio R — ^ pooi fe using the random 
selection algorithm and different pool sizes. The benefit of 
straggler mitigation comes from its ability to remove the 
overhead of slow workers at the end of a batch of tasks. 
When R is higher, each batch gains the full benefit of strag¬ 
gler mitigation and completes at the speed of the fastest 
workers, however the number of tasks completed in each 
batch is lower. Conversely, with a small ratio, workers spend 
most of their time working on unassigned tasks, and the im¬ 
pact of straggler mitigation is lessened. 

Impact on Crowdsourcing Systems. Straggler mitiga¬ 
tion is a general technique that does not affect the program¬ 
ming interface of the system it is applied to. It can therefore 
be used easily in conjunction with any existing crowdsourc¬ 
ing system that processes batches of microtasks. One im¬ 
portant benefit of hiding the variance in worker latencies is 
that task completion times become much more predictable. 
This characteristic is vital to the development of declarative 
crowd systems such as crowdsourced query processors, be- 




cause optimizers need to be able to accurately estimate the 
cost of executing a declarative crowd workflow. 

Working with Quality Control. Straggler mitigation is 
reminiscent of redundancy-based quality control algorithms 
such as [24] or [28] that use votes from multiple workers to 
better estimate the true answer. However, straggler miti¬ 
gation stops as soon as it has a single answer in order to 
return as quickly as possible. A naive combination of strag¬ 
gler mitigation and quality control might be inefficient. For 
example, duplicating a task for straggler mitigation that re¬ 
quires 3 votes for quality control would create 6 assignments, 
whereas perhaps only 4 or 5 are necessary to get 3 answers 
without any straggling tasks. In order to avoid this effect, 
CLAMShell decouples straggler mitigation assignments from 
quality control assignments. That is, a quality-controlled 
task is marked as active until it has received (say) 3 an¬ 
swers, and straggler mitigation assigns only single available 
workers to the task at a time to eliminate stragglers. In 
simulation, we find that this optimization can provide up to 
30% per-batch latency improvement in settings where strag¬ 
glers are much slower than average workers and most of the 
pool is composed of fast workers. 

4.2 Pool Maintenance: Better Mean Latency 

Straggler mitigation reduces the variance of task latencies, 
but if many workers in the labeling pool are slow on average, 
variance reduction will be ineffective at reducing per-batch 
latency. To improve the average speed of the pool over time, 
CLAMShell uses pool maintenance , a technique that contin¬ 
uously replaces slow workers in order to converge to a pool of 
mostly fast workers. Because a fast pool will label each task 
more quickly, pool maintenance reduces per-batch labeling 
latency over time. 

Our maintenance algorithm takes as input a latency thresh¬ 
old PMi, and continuously releases workers slower than the 
threshold asynchronously as labeling proceeds. To do so, it 
computes an empirical latency for each pool worker based on 
the worker’s completed tasks and flags the worker as a can¬ 
didate for removal if his latency is significantly above PMi 
(determined using a one-sided significance test). 

Instead of removing a slower worker before recruiting a re¬ 
placement, CLAMShell continuously recruits and trains work¬ 
ers in the background in order to maintain a reserve of new 
workers. Although this might seem costly, pipelining re¬ 
cruitment means that pool maintenance can proceed with¬ 
out blocking on worker recruitment, and we find empirically 
that the latency savings of pool maintenance translate to 
cost savings that overwhelm the cost of background recruit¬ 
ment (Section 6.2). The removed worker is paid for their ac¬ 
tive job (if any), and informed that there are no more tasks 
available for the experimental run. They are not blacklisted, 
so that future experiments are not biased. 

Pool speed convergence. The following model demon¬ 
strates the mean latency to which a maintained pool will 
eventually converge. Assume a population of workers with 
mean latencies fii following some global distribution W hav¬ 
ing mean T, and sample an initial pool Vo C W uniformly at 
random from W. Let PMi be a latency threshold splitting 
the distribution W into two parts, with probability densi¬ 
ties q and 1 — q above and below PMg respectively. Fur¬ 
ther, let /if be the mean latency among fast workers having 
/ii < PMe, and let fi s be the mean latency among slow work¬ 


ers having /ii > PM a. 

Then our initial pool has a mean latency E [fii\ = (1 -q)fif-\- 
q/i s . If at each maintenance step, we remove all slow workers 
having /ii > PMi and replace them with workers drawn 
randomly from W, and letting Vi be the pool after i steps, 
we see that Vi has mean latency E [/ii\ = (1 — q)fif + (q(l — 
q)/if + q 2 /a s ), and in general V n has mean latency: 

E tMi] = (E ^x 1 _ +9 n+ v 

i=0 

= (1 -q n+1 )lx f + q n+1 IM,. 

We observe that lim n ^oo E[/z*] = ///, that is, the pool con¬ 
verges to the mean latency of all workers below PMg. This 
implies that it is desirable to set PMt as low as possible: in 
practice, setting the threshold too low leads to thrashing, as 
we show in section 6.2. 

Simulation. We simulated how pool maintenance affects 
batch latency with respect to the task to pool size ratio R 
using a latency threshold PMg of one standard deviation 
below the mean. After each batch, we replace all workers 
slower than PMg with new samples from the worker dis¬ 
tribution. With pool maintenance, the batch latency falls 
quickly, nearly halving in just 15 to 20 batches. When there 
are many more tasks than pool workers, the effect becomes 
less pronounced, because there are enough tasks that slow 
workers who only complete a small fraction of tasks do not 
impact the per-batch latency. 

To better understand how the distribution of mean worker 
latency is changing over time, we simulate the mean pool la¬ 
tency (MPL) of the worker pool over time with and without 
maintenance, and compare the MPL to the mathematical 
model’s predictions. With maintenance, the pool’s MPL 
converges quickly to the model’s predicted asymptote, fol¬ 
lowing the model closely across pool-size to task ratios R. 

Latency Threshold. The pool maintenance latency thresh¬ 
old determines which workers are slow and should be re¬ 
moved from the pool. To pick a good threshold, we can 
observe the empirical distribution of all workers ever seen, 
and estimate the threshold as k standard deviations below 
the mean. The goal is to find a threshold low enough to 
decrease average pool latency by releasing slow workers, but 
high enough to avoid discarding the fastest workers from 
the pool. In Section 6.2, we vary the threshold and find 
that it has significant impact on the benefits of pool main¬ 
tenance. 

Extensions. As described, pool maintenance is focused 
only on reducing the mean latency of the pool. However, 
it can be easily extended to optimize for other criteria by 
choosing an objective function other than worker speed. For 
example, we could maintain a pool using quality (estimated 
using, e.g., inter-worker agreement [9]) to converge to a high- 
quality pool, use a weighted average to trade off quality and 
speed, or minimize another metric such as worker variance. 
Ramesh et al. [45] take a similar approach to identifying 
high-quality workers, though they use an oracle to deter¬ 
mine accuracy and evaluate their technique only in simula¬ 
tion. 

4.3 Combing Per-Batch Techniques 

Both straggler mitigation and pool maintenance deal with 
tail latencies — maintenance detects and removes workers 



whose average speeds are outliers, and straggler mitigation 
hides individual workers’ outlier tasks. From our initial live 
experiments, we were surprised to find that naively com¬ 
bining the two techniques together resulted in zero or even 
negative gains as compared to straggler mitigation alone. 
For example, the number of workers replaced in each batch 
was reduced from ~ 30 to less than 5 despite similar worker 
distributions. 

The reason is that straggler mitigation prevents high latency 
tasks by terminating the slower replicas. A consequence of 
this technique is the lack of high latency tasks, which ar- 
tifically skews every worker’s completion times towards the 
latency of the fastest workers, and makes directly measuring 
true worker latency infeasible. In response, we developed a 
simple model called TermEst to estimate the average laten¬ 
cies of terminated tasks based on the number of times a 
worker’s task is terminated. 

We assume the worker pool is represented by two workers — 
a slow worker w s and a fast worker w / that each uses a true 
latency of l s ,j and Ifj to complete task tj - and our goal is 
to estimate the latency of w s ^ terminated tasks. Let w s start 
N tasks Tail — {£ 1 , • • •, £tv}, where T t C T a u are terminated, 
and T c = T a u —T t are completed. Let h,T = jj ^2 ti eT be 
the average latency for u>k to complete a random task in T, 
and let Ik be Wk’s true mean latency. Assuming that Wf can 
start working on tj at any time after w s with uniform prob¬ 
ability, the probability that Wf starts early enough to finish 

and cause w s to terminate is Thus, w s is expected 

l s,j 

to be terminated N t times after starting N tasks: 

ls,i ~ lf,i ^ l s ,T t - If x N _ T 
ti eT t lsls ’ T t 

Rearranging the terms, we can estimate l s T t , where N c — 
N-N t : 


We then add a smoothing term alpha to N in order to com¬ 
pensate for the lack of latency evidence when N is small 
and avoid divide-by-zero errors when all of a worker’s tasks 
are terminated ( N = T). In practice, we estimate If as the 
empirical mean of the workers that caused any of u; s ’ past 
jobs to terminate: 


_ l f (N + a) 
s ' Tt N c +a 

Finally, to estimate the overall latency of w s by taking the 
the weighted average of l Sj r t and the empircal mean latency 
of the tasks w s is able to complete, I s ,t c - 


i N t 1 , N c 7 

l * = x ls ’ T * + v x s,Tc 


Note that our formulation is equivalent to modifying the la¬ 
tency threshold on a per worker basis. Thus, while changing 
the global latency threshold is important for setting a worker 
replacement rate, this adjustment replaces workers who are 
frequently terminated. 


5. FULL-RUN LATENCY OPTIMIZATION 

In order to eliminate the need to manually label all points in 
a potentially large set, CLAMShell acquires labels for only 
as many points as needed to train a predictive model of suf¬ 
ficient quality, then uses that model to impute labels for 
all remaining points. As described in Section 2, there are 


many factors that influence the latency of the labeling pro¬ 
cess. Relying on learning greatly decreases the task count 
necessary to label the entire dataset, but has implications 
for the decision latency and batch size involved. In partic¬ 
ular, CLAMShell uses active learning techniques to reduce 
the task count even further, but trades this improvement for 
increased decision latency (the learner must choose which 
points to label next) and decreased batch size (active learn¬ 
ing is inherently iterative and cannot label as many points 
in parallel). 

In this section, we describe how CLAMShell ameliorates the 
drawbacks of active learning for low-latency labeling. We 
introduce hybrid learning , a novel technique which combines 
active and passive learning to maximize pool parallelism and 
hide the inherent limits of active learning batch size. We 
also describe how CLAMShell leverages existing techniques 
to set an effective batch size for active learning and uses 
asynchronous model retraining to hide active learning’s de¬ 
cision latency. 

5.1 Hybrid learning 

Active learning uses the current trained model to decide 
which points to label in the point selection phase, reducing 
the number of points needing labels in order to train a high- 
quality model. In practice, however, there are two major 
challenges to active learning at low latency. First, at each 
iteration, active learning has a limited batch size—setting 
the batch size too high can cause the model to converge even 
more slowly than passive learning [46]. This limits the wall- 
clock speed at which active learning can proceed. Second, 
when labeling work is challenging, it will be hard to train 
a good model. As a result, the current trained model may 
misguide the point selection phase, and active learning may 
perform poorly, perhaps even worse than passive learning. 
On the other hand, passive learning that trains a model 
using a randomly sampled data points can proceed as fast 
as the crowd can label, but it will waste human effort for 
easy labeling work. 

To address these issues, we propose hybrid learning in 
CLAMShell, with the basic idea of maintaining the best 
traits of both passive and active learning, allowing for fast 
model convergence on both easy and hard data labeling 
work. Hybrid learning simultaneously acquires labels using 
the active selection strategy and random sampling, maximiz¬ 
ing crowd worker parallelism and compensating for datasets 
where active learning alone would perform poorly. As a re¬ 
sult, label acquisition can proceed at high speed in spite of 
a low active learning batch size. 

Point Selection. Once a batch size has been selected 
for active learning (Section 5.2, below), hybrid learning at¬ 
tempts to maximize crowd worker parallelism by ensuring 
that each worker in the pool has at least one point to la¬ 
bel. That is, given a batch size k and a pool size p, hybrid 
learning uses the active selection criterion to choose k points 
for labeling, then randomly selects max(0,p — k) points for 
passive labeling. Because CLAMShell caches all previously 
labeled points, if the points chosen for active or passive la¬ 
beling overlap, their labels are read from the cache and ad¬ 
ditional points are selected for labeling. 

Model Retraining. Once a new batch of points has been 
labeled, hybrid learning retrains a model on all previously 
observed labels. These points come from two sampling dis- 



tributions: uncertain sampling (active learning) and random 
sampling (passive learning). Currently, CLAMShell retrains 
the model on the union of these points without distinguish¬ 
ing their difference, though it does weight points based on 
the active-to-passive ratio (i.e., ~j). If users provide hints to 
CLAMShell about how hard their labeling work is (e.g., very 
difficult), CLAMShell can adjust these weights accordingly. 
We leave the exploration of optimal re-weighting schemes 
for future work. 

5.2 Active learning batch size 

Because the speed of active learning is constrained by the 
size of its batches, setting a good batch size is important 
for fast convergence. Too small, and training will be slow 
because it will take a long time to label all the points. Too 
large, and training will be slow because each batch contains 
less useful points, slowing down convergence to a good model 
(or even converging to a bad one!). The literature provides 
no guidance on an appropriate batch size for batch-mode 
active learning, assuming that that the batch size is cho¬ 
sen by the user in advance. Chakraborty et. al [10] of¬ 
fer an active learning technique that dynamically sets the 
batch size, but it is not generic across learners and requires 
knowledge of the labeling time for each instance. We exper¬ 
imented extensively with the active learning batch size, and 
found that once batch size was within a reasonable range 
(10-40), there was no significant correlation between batch 
size and convergence rates on any single dataset, let alone 
across datasets. 

As a result, we rely on empirical results from our hybrid 
learning experiments (Section 6.5) to set an active learning 
batch size that works well with our hybrid strategy. Those 
experiments show that the fraction of the pool r = ^ al¬ 
located to active learning has a significant impact on the 
convergence of the learner, and that r = 0.5 is a reasonable 
value for multiple datasets. In our end-to-end experiments, 
we set k = 0.5p accordingly. 

5.3 Active learning decision latency 

The time taken by the active learner to retrain a model and 
select a new batch of points after the previous batch has 
been labeled has a significant impact on full-run latency, 
because the labeling process blocks until the learner is ready 
with the next batch. To mitigate this latency (which is not 
an issue for passive learning), CLAMShell uses two known 
techniques. 

First, rather than consider all unlabeled points for selection 
in the next batch, we consider only a uniform random sam¬ 
ple of the points. This has been shown to have little impact 
on active learning convergence, and offers significant perfor¬ 
mance improvements: the point selection time is linear in 
the sample size, not the size of the entire unlabeled dataset, 
which might include millions of examples. 

Second, rather than performing retraining and selection syn¬ 
chronously at the end of each batch, CLAMShell continu¬ 
ally retrains models asynchronously on the latest available 
points. A new batch of points is selected based on each 
new model, so at any point in time there is an available 
model and an available selection of points for the next batch. 
When each batch of points completes the labeling process, 
the next batch is selected based on the most recently com¬ 
puted model. This trades off decision latency for staleness 
of points to be selected, and empirically we find that it does 
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Variance 
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hybrid 
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No 
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Table 2: CLAMShell techniques (AL: Active Learning). 


Par am 

Description 

PM t 

Latency threshold for pool maintenance 

SM 

Straggler mitigation: on ( SM), off ( NoSM ). 

N p 

Number of workers in the retainer pool. 

Ng 

Task complexity: # records grouped a HIT. 
Simple (1), Medium (5), Complex (10 records) 

R 

Pool-batch ratio. 

Alg 

Learning algorithm: active (AL), passive (PL), 
hybrid learning (HL), or none (NL) 

Table 3: Experimental Parameters 


not significantly impact model convergence. 

5.4 Putting it all together 

CLAMShell is powered by three techniques: Straggler Mit¬ 
igation (straggler), Retainer Pool Maintenance (pool), and 
Hybrid Learning (hybrid). Table 2 summarizes their impact 
on system performance across four axes: (1) Can they im¬ 
prove the mean latency of labeling? (2) Can they mitigate 
the variance of individual workers’ labeling latency? (3) Do 
they require additional cost to use? (4) Are they general or 
restricted to a certain labeling setting? 

6. EVALUATION 

In this section, we evaluate CLAMShell both in simulation 
and on live crowd workers on MTurk in order to show that 
it enables data labeling to proceed at interactive speeds. We 
first evaluate each technique in isolation, then provide end- 
to-end experiments demonstrating the total time it takes 
to label unlabeled datasets. Table 3 summarizes important 
parameters varied in the experiments. 

6.1 Experimental Setup 

Simulator. The simulated experiments described in the 
previous text and the following evaluation are run on a 
python simulator that models a retainer-pool crowd data la¬ 
beler and implements uncertainty sampling on top of scikit- 
learn’s model training [44]. To simulate crowd workers, we 
use traces from the medical deployment described in Sec¬ 
tion 2.1. From each trace, we measure each worker’s mean 
labeling latency Hi, variance in labeling latency erf, and 
mean accuracy A*. We then generate a worker’s latency 
on an assigned labeling task by drawing a sample i.i.d from 
J\f(/jLi, erf), and generate the label itself by returning the cor¬ 
rect label with probability A i and the incorrect label with 
probability 1 — A*. Using these worker pools, the simulator 
can model recruitment (adding random workers to the pool), 
pool maintenance (releasing workers with high observed jii 
from the pool), straggler mitigation (assigning multiple sim¬ 
ulated workers to the same task and returning the minimum 
of the sampled latencies), and active learning (using simu¬ 
lated workers to label batches of points and measuring the 
latency of the whole batch). 

Live Experiments. The live experiments discussed below 
run on a custom implementation of the retainer model for 
MTurk. Recruitment occurs by repeatedly re-posting re¬ 
cruitment tasks every 3 minutes to MTurk until the desired 



number of workers have joined the pool. Workers are paid 
$.05 / minute to wait for available work once they join a 
pool, and $.02 / record to perform the work once it becomes 
available. MTurk tasks require a minimum qualification of 
85% worker approval to join a pool. Experiments in these 
pools were run at multiple times of day on both weekdays 
and weekends. In contrast with prior work, we found that 
results were remarkably consistent across these parameters 
when using our latency mitigation techniques. This may be 
the result of our relatively strict qualification requirement, 
or may reflect more systemic changes in the MTurk market¬ 
place. Following the retainer pool model, we assume recruit¬ 
ment time is amortized across batches and measure latency 
from the moment the first task is sent to the pool, rather 
than from the beginning of the recruitment process. Over¬ 
all, we collected timing results for nearly 250,000 individual 
task assignments over the span of several weeks. 

Datasets. The active learning tasks run in this evalua¬ 
tion are all classification tasks, based on publicly available 
datasets. The MNIST dataset [33] contains 70,000 black and 
white images of handwritten digits, and the multi-class clas¬ 
sification task is to detect which digit an image represents. 
We used raw pixel values as features, leading to 784 features 
per image. The CIFAR-10 dataset [31] contains 60,000 color 
images of various objects, and the classification task is to 
identify the category of the primary object in each image. 
In order to make the learning task simpler, we limited the 
topic categories to two: “Birds” and “Airplanes”. We used 
raw pixel values as features, generating 3072 features per 
image. In addition to the real datasets, which have concrete 
labeling tasks that we can send to human workers, we also 
generate datasets of varying difficulty to illustrate the rela¬ 
tionship between problem hardness and the performance of 
our techniques. These datasets are generated with the scikit- 
learn data generator, which builds classification problems 
following an adaptation of the algorithm from [19]. 

6.2 Pool Maintenance 

In this section, we evaluate the effects of pool maintenance 
on batch time. The experiments execute 500 tasks that label 
MNIST digit images. We compare tasks of varying complex¬ 
ity (Simple, Medium, Complex) that use N g — 1,5, or 10 
MNIST images, respectively. The latency threshold is set to 
PM( — 8 and PM a — oo (no maintenance). 
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Figure 3: # points labeled over time. 


Figure 3 is an overview of the total number of labeled points 
(N g x Ntasks) over time for each configuration. The slope 
of each curve describes the speed of task completion, where 
a flat curve denotes stragglers that take a very long time to 
complete a task. We find that task completion for simple 
tasks is uniformly fast, so pool maintenance provides little 
additional benefit; however, more complex tasks are affected 



Figure 4: Summary of end-to-end cost and latency experi¬ 
ments with and without pool maintenance. 

by outliers, and maintenance’s ability to cull slow workers 
helps reduce the presence of very long tasks. 

Overall. Ultimately, pool maintenance does not improve 
end-to-end latency for simple tasks significantly, but is able 
to reduce the latency for medium and complex tasks by 1.3x 
and 1.8 x on average, respectively (Figure 4). Interestingly, 
despite its added cost to recruit workers concurrently with 
labeling tasks, maintenance is able to reduce the overall cost 
of the medium and complex tasks by 7— 16%. This is due to 
finishing the experiment faster and saving the cost of paying 
workers to stay in the retainer pool. Changing the rate paid 
to waiting workers may increase or reduce this effect. 

Complex Medium Simple 
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Figure 5: Comparison between age of the worker in the pool 
when starting a given task and the time to complete the 
task. Tasks where the latency per labeled point is greater 
than 8 seconds are colored in blue. 

Latency Distribution. To better understand how pool 
maintenance effects the composition of the worker pool, Fig¬ 
ure 5 plots task completion speeds against the age of the 
worker when starting a given task. We define a worker’s age 
with respect to task U as the number of tasks the user has al¬ 
ready completed in the experimental run. The y-axis shows 
the latency to acquire a single label, computed as task j^ tency . 
each column shows all tasks across the runs for a given task 
complexity; and the top and bottom rows are with main¬ 
tenance turned on (PM&) and off (PM 0c ). In addition, the 
points are categorized as fast (< 4 sec per label), medium 
(5 — 7 sec), or slow (> 8 sec). Although workers that are new 
to the worker pool naturally exhibit high task latency vari¬ 
ability, maintenance is able to purge the slow workers over 
time. For every task complexity, the slow and even medium 
latency tasks are nearly all removed once workers have re¬ 
mained in the pool for more than 4 minutes. In contrast, 
the lack of pool maintenance allows slow and highly vari¬ 
able workers continue working on tasks, so that slow tasks 
are seen throughout the entire experiment. 
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Figure 6: Mean pool latency over time. 


Mean Pool Latency. Figure 6 provides a different view on 
pool maintenance’s effects on the worker pool - it measures 
the mean pool latency (MPL) for each batch of tasks sent to 
the pool throughout the experiment. MPL is computed as 
the average latency of all completed tasks in the pool. Each 
subplot compares the MPL with and without maintenance 
for a given experimental run and task complexity. While the 
average of each pair of curves is similar, pool maintenance 
shows significantly less variance across the batches because 
it effectively removes the long tail of the latency distribution. 
The variation in the pool maintenance curve is simply due 
to the variation of the newly recruited workers. 



300 600 900 300 600 900 300 600 900 300 600 900 

Time (sec) 


— Run 0 — Run 1 


Figure 7: The number of workers replaced over time for 
varying maintenance latency thresholds. 


Latency Threshold. Our analysis of MPL shows that pool 
maintenance is able to remove outliers from the worker pool, 
ffowever, the reduction in MPL is not as fast as predicted 
by the model or simulations presented in Section 4.2. This 
is expected, as workers may not maintain consistent speed 
over time, and our empirical estimates of worker’s speed may 
be inaccurate. Another potential issue may be that our la¬ 
tency threshold is poorly tuned, thus in our final experiment 
(Figures 7 and 8), we study whether varying the latency 
threshold between 2 and 32 seconds can affect the median 
task latency in addition to the variance. Figure 7 demon¬ 
strates that decreasing the threshold causes more workers to 
be replaced during a run, as expected. Figure 8 shows the 
latency percentiles at different worker-age slices (e.g., < 5 
tasks) in the experiment. We find that varying the thresh¬ 
old affects both the median and higher percentiles, with a 
more pronounced effect on the extrema task latencies. For 
this workload, the optimal threshold is PMs, which can re¬ 
duce the straggler latencies by nearly 2x. However, further 
reducing the threshold to 4 or 2 seconds goes beyond the 
point where even fast workers are able to complete tasks, 
and effectively replaces all workers with the mean of the un¬ 
derlying MTurk distribution. The curves reduce across work 
slices due to the effects of pool maintenance, consistent with 
the analysis in Figure 5. 
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Figure 8: 50 th , 95 th , and 99 th percentiles of task latency 
as maintenance latency threshold varies. Each facet is a 
different amount of time into the experiment. 
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Figure 9: Straggler mitigation dramatically reduces the 
standard deviation of per-task latency across batches. 


6.3 Straggler Mitigation 

In this section, we evaluate the performance of straggler mit¬ 
igation along two key metrics: task latency and task vari¬ 
ance. An important parameter of straggler mitigation is R , 
the ratio of workers in the pool to tasks in a batch (Table 3), 
because it controls how many workers are assigned on aver¬ 
age to eliminate stragglers. Set too low, and stragglers will 
occur unfettered. Set too high, and money and effort will 
be wasted unnecessarily. In these experiments, we set task 
complexity to N g = 5, the pool size to N p = 15, and give 
workers CIFAR-10 tasks. 

Variance. One of the key properties of straggler mitigation 
is its ability to reduce the variance of individual task laten¬ 
cies. Figure 9 plots the standard deviation of the latencies 
of task completion times for each batch. Straggler mitiga¬ 
tion consistently decreases the standard deviation by 5 to 
10x (a decrease in variance of up to 100x!), very important 
when trying to predict the run-time of a batch consistently. 
One interesting observation is the jaggedness of the R = 3 
plots. This is likely because with 3 times as many workers 
as tasks, workers spend much more time waiting, and are 
slow to respond when work becomes available because they 
are involved in other work. 

Latency. Because straggler mitigation enables task batches 
to finish without waiting for high-latency straggler task as¬ 
signments to complete, it significantly reduces the latency 
of each batch, up to 5 x on some runs (Figure 10). Increas¬ 
ing R can increase those gains, but comes at an additional 
cost, as it pays more workers to complete each task. Al¬ 
though intuitively we might expect straggler mitigation to 
become more and more effective as R increases, there are 
practical limitations that prevent this effect. With high R , 
even fast workers are often terminated before finishing their 
tasks because many workers are working on every task at 
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Figure 10: Points labeled over time with straggler mitigation 


once. In addition to the added latency of this termination 
(workers must click a dialog to finish the old task and be 
presented with a new one, which takes seconds), this creates 
a frustrating environment for workers, who feel as though 
they aren’t being allowed to work. As a result, keeping R 
between 0.75 and 1 is attractive, as it limits cost and still 
shows impressive speedups. 
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Figure 11: Straggler mitigation increases costs by 1 to 2x, 
improves latency by 2.5 — 5x, and variance by 4 — 14x. 

6.4 Combining Per-Batch Techniques 

Figure 12 summarizes the effects of combining both straggler 
mitigation and pool maintenance when labeling CIFAR-10 
tasks. We see that the two techniques can be complemen¬ 
tary, but in some experiments we observe destructive inter¬ 
ference between straggler mitigation and pool maintenance. 
We believe this is a result of fluctuating conditions on the un¬ 
derlying crowd platform across experiments: sometimes the 
initial pool selection is high-quality, rendering pool main¬ 
tenance ineffective, and other times very slow workers join 
the pool and maintenance is invaluable. We note that in all 
cases, combining per-batch techniques still results in a sig¬ 
nificant speedup over not using either technique, leading to a 
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Figure 12: End-to-end Latency, Variance, and Costs for dif¬ 
ferent straggler mitigation and pool maintenance configura¬ 
tions. 


reduction in latency of up to 6 x, and reduction in standard 
deviation of up to 15 x. 

Detailed View. Figure 13 shows the latency of every task 
for a single experimental run with every combination of 
straggler mitigation and pool maintenance. Each line seg¬ 
ment depicts the start and end time of a specific task. Red 
tasks are successfully completed, while blue tasks are termi¬ 
nated due to the worker leaving the pool or because another 
worker finished the task in less time. Red and blue dots de¬ 
note the start and end of a batch, and the tasks completed by 
a given worker are aligned vertically along the y-axis. 

The top two subplots show the value of pool maintenance - 
although stragglers are still present under pool maintenance, 
there are considerably fewer and lower magnitude stragglers 
as compared to the baseline pool. The bottom two subplots 
show that maintenance can further improve straggler mit¬ 
igation by reducing the number of stragglers that must be 
ameliorated. 
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Figure 14: Replacement rate when using TermEst (Sec¬ 
tion 4.3) with a = 1. 

Effect of TermEst. Figure 14 measures the effectiveness 
of our model for estimating the latency of terminated tasks 
(Section 4.3). We see that, as expected, without TermEst, 
the worker replacement rate decreases dramatically, because 
workers are estimated to be faster than PMt and are not re¬ 
placed. Adding TermEst adjusts for the gap: with it turned 
on, replacement happens just as frequently as with no strag¬ 
gler mitigation. 

6.5 Hybrid Learning 

In this section, we evaluate our hybrid learning strategy, 
demonstrating that it is effective on datasets where either 
active or passive learning would perform better, and that 
it successfully takes advantage of pool parallelism to reduce 
the time required to train a good model. 

Accuracy. The Hybrid algorithm depends on the assump¬ 
tion that active learning does not outperform passive learn¬ 
ing in all settings. Figure 15 validates this assumption in 
our simulator. It plots learning curves for active and pas¬ 
sive learning on generated datasets of increasing hardness 
(rows show number of generated features), and shows how 
each learner performs given different amounts of the crowd’s 
resources (columns show the percentage of the crowd pool 
used for active learning). On easier datasets, active learning 
significantly outperforms passive learning, but when given 
as many resources as active learning, passive learning is the 
better choice on harder learning tasks where active point 
selection is ineffective. This reinforces our belief that a suc¬ 
cessful hybrid strategy can trade off between the two ap¬ 
proaches, and the hybrid lines in both Figure 15 and Fig¬ 
ure 16 (wherein we replicate the simulator results on real- 
world datasets with live workers) demonstrate that the strat¬ 
egy is indeed successful. In all cases, hybrid performs as well 
as or better than either active or passive learning. 
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Figure 13: Per-assignment view of each straggler mitigation and churn configuration. Each horizontal segment is the length 
of an assignment. Red and blue dots denote batch boundaries 
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Figure 15: Active, Passive, and Hybrid strategies for learn- 
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Figure 16: Active, Passive, and Hybrid strategies for learn¬ 
ing on crowds run on real-world datasets on live workers. 

Latency savings. As a result of the fact that hybrid learn¬ 
ing leverages the full parallelism of the crowd (as opposed to 
active learning with a limited batch size), the hybrid learn¬ 
ing strategy is able to train better models faster. Figure 16 
shows the hybrid learning strategy’s performance compared 
to pure active or pure passive over time on the MNIST and 
CIFAR datasets. The x and y axes of each plot show the 
accuracy improvement over time as points are labeled, the 
rows depict the datasets, and the columns represent the set¬ 
ting of the AL batch size as a percentage of the crowd pool 
size. In the same amount of time, the hybrid strategy is 
always the preferred solution for model training. In fact, 
on average, hybrid trains models of 85% accuracy on CI¬ 
FAR (70% accuracy on MNIST) 1.2x (1.7x) faster than 
pure active learning and 1.6x (1.2x) faster than pure pas¬ 


sive learning. 

6.6 End-to-End Evaluation 

In this section, we evaluate the end-to-end performance of 
CLAMShell against two baselines. Base-NR, which repre¬ 
sents a typical crowd labeling deployment, sends labels out 
all at once, uses no retainer pool, and trains passive learn¬ 
ing models to infer labels for unlabeled records. Base-R, 
which leverages the latest techniques for low-latency crowd¬ 
sourcing, uses a retainer pool to label points in batches and 
active learning to infer labels for unlabeled records. In this 
experiment, 500 points were labeled by each strategy on the 
CIFAR-10 and MNIST datasets, and the accuracy of the 
resulting models were measured. 

Results. Figures 17 and 18 summarize the results of this 
evaluation. In Figure 17, the rows represent an accuracy 
threshold for the model, and the plots show the wall-clock 
time taken by each strategy to train a model of that accu¬ 
racy. Note that neither baselines reach an accuracy of 80% 
on the MINST dataset in 500 points. To reach an accuracy 
of 75%, CLAMShell requires 4 to 5x less time than Base-NR. 
Figure 18 displays the full learning curves for each strategy, 
demonstrating that CLAMShell dominates both baselines in 
terms of model accuracy. 

We also measured the raw time to acquire 500 labels from 
the crowd, and found that CLAMShell increases the label¬ 
ing throughput by 7.24x compared to Base-NR. In addition, 
CLAMShell reduces the variance of labeling by 151 x, and 
the absolute values are extremely low: 3.1 seconds vs. 475 
seconds. 
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Figure 17: Summary of end to end to reach model accuracy 
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Figure 18: Wall clock time vs Model Accuracy 

7. CONCLUSION & FUTURE DIRECTIONS 

In summary, we have introduced CLAMShell, a system for 
data labelling that acquires labels from human crowd work¬ 
ers at interactive speeds. Latency can arise from many 
points in the labeling lifecycle, and CLAMShell addresses 
the key sources of latency with novel techniques. Straggler 
mitigation reduces the variance of task latencies within a 
batch by assigning additional workers to complete the task. 
Pool maintenance increases the average speed of workers in 
a labeling pool by replacing slow workers with faster ones 
over time. Hybrid learning reduces end-to-end labeling time 
by combining the fast convergence of active learning with 
the parallelism of passive learning. The result is an impor¬ 
tant step towards integrating data labeling with interactive 
systems for data analysis. 

Though CLAMShell takes a comprehensive approach to la¬ 
tency reduction for data labeling, there are a number of 
directions in which this work can be extended. First, we 
would like to explore richer objective functions than mean 
worker speed for pool maintenance in order to strike a bal¬ 
ance between worker speed, variance and quality. In addi¬ 
tion, hybrid learning simply trains a single model on the 
points labeled by active and passive learners. We would 
like to investigate whether better models can be trained by 
keeping the points separate and using more sophisticated 
machine learning techniques such as model averaging or en- 
sembling. Finally, we are integrating CLAMShell with an 
interactive data cleaning system [20] in order to learn how 
it performs with application-driven latency constraints on a 
wider range of crowd tasks. 
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